# Import packages
import pandas as pd
import numpy as np
import statsmodels.api as sm
L17: Panel regression intro
Preliminaries
Lecture overview
To make the concepts in these lecture notes (and the next) more concrete, we will apply them to the empirical analysis of whether firm profitability is predictable. To be more specific, we will ask:
Which of the following firm characteristics (if any) have statistically significant predictive power over firms’ profitability: the firm’s cash holdings, its book leverage or its capital investments?
In this lecture, we will start by collecting the data we need for this analysis and producing some key descriptive statistics of the data. We will then perform a regression analysis where firm future profitability is the dependent variable, and firm cash holdings, book leverage,and investment are the explanatory variables. In this lecture, we will showcase some common practical issues that one needs to be aware of any time they run a regression: 1. The effect of outliers on our regression results 2. The difference between economic magnitude and statistical significance 3. Multicollinearity (highly correlated independent variables)
In the following lecture, we will continue this analysis by tackling two other very common issues with regression analysis: 1. The potential presence of “fixed-effects” in the data 2. The issue of correlated error terms in the regression
The data
The first step in our analysis is to decide exactly what data we will use to try to answer this question. 1. We will use the Compustat dataset as raw data (the “compa.zip” file in the “data” folder).
We then have to be explicit about exactly how each variable in the analysis will be calculated: 1. Dependent variable: - roa = net income before extraordinary items (ib
) divided by total assets (at
)
- Independent variables:
- cash holdings = cash and cash equivalents (
che
) divided by total assets - book leverage = long-term debt (
dltt
) divided by total assets - capital expenditures = change in net PP&E (
ppent
) divided by total assets
- cash holdings = cash and cash equivalents (
Note that all the variables are scaled by some measure of size (total assets). This is to ensure that our regression results are not dominated by large firms. It also helps avoid heteroskedasticity problems (the dollar-amount data for large firms is much more volatile than the data for small firms).
# Load the cleaned compa data, keeping only what we need
= pd.read_pickle('../data/compa.zip')
comp = comp.loc[comp['at']>0, ['permno','datadate','ib','at','che','dltt','ppent','sich']].copy()
comp = comp.sort_values(['permno','datadate'])
comp comp.dtypes
permno float64
datadate object
ib float64
at float64
che float64
dltt float64
ppent float64
sich float64
dtype: object
# Create main variables
'year'] = pd.to_datetime(comp['datadate']).dt.year
comp[
'roa'] = comp['ib'] / comp['at']
comp['future_roa'] = comp.groupby('permno')['roa'].shift(-1)
comp[
'cash'] = comp['che'] / comp['at']
comp['leverage'] = comp['dltt'] / comp['at']
comp['investment'] = comp.groupby('permno')['ppent'].diff() / comp['at']
comp[
= ['future_roa','cash','leverage','investment'] main_vars
# Winsorize main variables at the 1 and 99 percentiles
= []
main_vars_wins for v in main_vars:
f'w_{v}'] = comp[v].clip(lower=comp[v].quantile(0.01), upper=comp[v].quantile(0.99))
comp[= main_vars_wins + [f'w_{v}'] main_vars_wins
# Add a constant and save the data for next time
'const'] = 1
comp['../data/comp_clean.zip') comp.to_pickle(
Descriptive statistics
Summarize the main variables, both winsorized and unwinsorized:
comp[main_vars_wins].describe()
w_future_roa | w_cash | w_leverage | w_investment | |
---|---|---|---|---|
count | 212955.000000 | 236417.000000 | 235969.000000 | 207458.000000 |
mean | -0.044766 | 0.174446 | 0.176818 | 0.014380 |
std | 0.254165 | 0.218750 | 0.195127 | 0.089620 |
min | -1.482391 | 0.000181 | 0.000000 | -0.400194 |
25% | -0.029755 | 0.025310 | 0.008017 | -0.006815 |
50% | 0.019406 | 0.080652 | 0.112543 | 0.003152 |
75% | 0.061315 | 0.234988 | 0.287561 | 0.033348 |
max | 0.264823 | 0.930186 | 0.855270 | 0.361891 |
comp[main_vars].describe()
future_roa | cash | leverage | investment | |
---|---|---|---|---|
count | 212955.000000 | 236417.000000 | 235969.000000 | 207458.000000 |
mean | -0.054115 | 0.174782 | 0.180717 | -0.035447 |
std | 1.568233 | 0.219955 | 0.248598 | 11.025308 |
min | -218.000000 | -0.069190 | 0.000000 | -4110.714102 |
25% | -0.029755 | 0.025310 | 0.008017 | -0.006815 |
50% | 0.019406 | 0.080652 | 0.112543 | 0.003152 |
75% | 0.061315 | 0.234988 | 0.287561 | 0.033348 |
max | 226.310379 | 1.000000 | 39.592899 | 0.999933 |
; comp[main_vars_wins].boxplot()
; comp[main_vars].boxplot()
Check correlations:
comp[main_vars_wins].corr()
w_future_roa | w_cash | w_leverage | w_investment | |
---|---|---|---|---|
w_future_roa | 1.000000 | -0.324233 | 0.063734 | 0.076357 |
w_cash | -0.324233 | 1.000000 | -0.315295 | -0.070145 |
w_leverage | 0.063734 | -0.315295 | 1.000000 | 0.074024 |
w_investment | 0.076357 | -0.070145 | 0.074024 | 1.000000 |
We can take a look at pairwise scatter plots (to visualize these correlations) using pd.plotting.scatter_matrix()
:
=(10,10)); pd.plotting.scatter_matrix(comp[main_vars_wins], figsize
These plots help us realize that point-statistics (single numbers) like the correlation between profitability and cash holdings may mask how rich the data truly is and make us believe that patterns in the data (the -0.32 correlation seems quite strong) are more robust than they really are. Always look at your data (plot it). Just make sure you do it after you mitigate the effect of outliers or the images will look very distorted.
Linear regression
Let’s use the non-winsorized data first for our baseline regression:
# Using non-winsorized data
= sm.OLS(endog = comp['future_roa'],
results_base = comp[['const','cash','leverage','investment']],
exog = 'drop').fit()
missing print(results_base.summary())
OLS Regression Results
==============================================================================
Dep. Variable: future_roa R-squared: 0.001
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 65.43
Date: Fri, 25 Feb 2022 Prob (F-statistic): 2.82e-42
Time: 14:32:28 Log-Likelihood: -3.4663e+05
No. Observations: 185315 AIC: 6.933e+05
Df Residuals: 185311 BIC: 6.933e+05
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0144 0.006 2.327 0.020 0.002 0.027
cash -0.2424 0.018 -13.340 0.000 -0.278 -0.207
leverage -0.1111 0.018 -6.133 0.000 -0.147 -0.076
investment -0.0013 0.000 -3.528 0.000 -0.002 -0.001
==============================================================================
Omnibus: 560269.038 Durbin-Watson: 0.913
Prob(Omnibus): 0.000 Jarque-Bera (JB): 887102384117.364
Skew: 43.457 Prob(JB): 0.00
Kurtosis: 10721.226 Cond. No. 55.6
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The effect of outliers
Now let’s use the winsorized variables and look at the difference. Check the coefficient on the investment variable in particular.
# Using winsorized data
= sm.OLS(endog = comp['w_future_roa'],
results_wins = comp[['const','w_cash','w_leverage','w_investment']],
exog = 'drop').fit()
missing print(results_wins.summary())
OLS Regression Results
==============================================================================
Dep. Variable: w_future_roa R-squared: 0.096
Model: OLS Adj. R-squared: 0.096
Method: Least Squares F-statistic: 6524.
Date: Fri, 25 Feb 2022 Prob (F-statistic): 0.00
Time: 14:32:28 Log-Likelihood: 5498.7
No. Observations: 185315 AIC: -1.099e+04
Df Residuals: 185311 BIC: -1.095e+04
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 0.0314 0.001 32.312 0.000 0.030 0.033
w_cash -0.3712 0.003 -134.431 0.000 -0.377 -0.366
w_leverage -0.0727 0.003 -23.971 0.000 -0.079 -0.067
w_investment 0.1634 0.006 25.736 0.000 0.151 0.176
==============================================================================
Omnibus: 126392.896 Durbin-Watson: 1.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1848286.569
Skew: -3.171 Prob(JB): 0.00
Kurtosis: 17.112 Cond. No. 12.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Economic significance vs statistical significance
It is easy to use the results in the regression output above and decide (based on p-values or t-statistics) if the independent variables have a statistically significant relation with the dependent variable. But it is not clear if there relations are large or small in magnitude (does investment have a large impact on future profitability? larger than leverage?). That is what we mean by economically significant.
To help ease the interpretation of these economic magnitudes, we generally standardize all the variables in the regression by subtracting their mean and dividing by their standard deviation (see below). After doing this, the regression coefficient on any independent variable X, will tell us by how many standard deviations we expect the dependent variable Y to move, when the X variable changes by one standard deviation.
So after the normalization, the X variables with larger coefficients have a larger economic impact on the Y variable.
# Create list of names we want to give to the normalized varsions of these variables
= ['n_' + x for x in main_vars]
normalized_vars normalized_vars
['n_future_roa', 'n_cash', 'n_leverage', 'n_investment']
# Create normalized variables
= (comp[main_vars_wins] - comp[main_vars_wins].mean()) / comp[main_vars_wins].std() comp[normalized_vars]
# Check that all means are 0 and all std deviations are 1
comp[normalized_vars].describe()
n_future_roa | n_cash | n_leverage | n_investment | |
---|---|---|---|---|
count | 2.129550e+05 | 2.364170e+05 | 2.359690e+05 | 2.074580e+05 |
mean | 4.288606e-15 | -1.581191e-14 | -2.359298e-14 | 2.044987e-15 |
std | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 |
min | -5.656264e+00 | -7.966420e-01 | -9.061689e-01 | -4.625934e+00 |
25% | 5.906184e-02 | -6.817643e-01 | -8.650839e-01 | -2.364965e-01 |
50% | 2.524840e-01 | -4.287752e-01 | -3.294019e-01 | -1.252835e-01 |
75% | 4.173736e-01 | 2.767601e-01 | 5.675461e-01 | 2.116544e-01 |
max | 1.218065e+00 | 3.454800e+00 | 3.476980e+00 | 3.877626e+00 |
# Using winsorized, then normalized data
= sm.OLS(endog = comp['n_future_roa'],
results_norm = comp[['const','n_cash','n_leverage','n_investment']],
exog = 'drop').fit()
missing print(results_norm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: n_future_roa R-squared: 0.096
Model: OLS Adj. R-squared: 0.096
Method: Least Squares F-statistic: 6524.
Date: Fri, 25 Feb 2022 Prob (F-statistic): 0.00
Time: 14:32:28 Log-Likelihood: -2.4834e+05
No. Observations: 185315 AIC: 4.967e+05
Df Residuals: 185311 BIC: 4.967e+05
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 0.0037 0.002 1.737 0.082 -0.000 0.008
n_cash -0.3195 0.002 -134.431 0.000 -0.324 -0.315
n_leverage -0.0558 0.002 -23.971 0.000 -0.060 -0.051
n_investment 0.0576 0.002 25.736 0.000 0.053 0.062
==============================================================================
Omnibus: 126392.896 Durbin-Watson: 1.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1848286.569
Skew: -3.171 Prob(JB): 0.00
Kurtosis: 17.112 Cond. No. 1.40
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Multicollinearity
One common way that multicollinearity arises when two or more of your independent variables (X) are very highly correlated (close to 1). The usual way to deal with this issue, is to calculate the correlation matrix between all the variables in your study, to identify which group of variables are highly correlated with each other. Then we simply drop all but one of them from the analysis.
Below, we artificially create this problem in our example application, by introducing in our regression a variable that equals the leverage variable times 100. This will have a correlation of 1 with the leverage variable. However, as we’ll see below, “statsmodels” will NOT give us an error. So it’s up to us to make sure that we don’t have this problem in our data by always looking at the correlation matrix of our data.
# Add variable that is collinear with size
'lev100'] = comp['n_leverage']*100 comp[
# Run regression with multicollinearity problem
= comp[['const','n_cash','n_leverage','n_investment','lev100']]
X = sm.OLS(comp['n_future_roa'],X,missing = 'drop').fit()
results_mc print(results_mc.summary())
OLS Regression Results
==============================================================================
Dep. Variable: n_future_roa R-squared: 0.096
Model: OLS Adj. R-squared: 0.096
Method: Least Squares F-statistic: 6524.
Date: Fri, 25 Feb 2022 Prob (F-statistic): 0.00
Time: 14:32:28 Log-Likelihood: -2.4834e+05
No. Observations: 185315 AIC: 4.967e+05
Df Residuals: 185311 BIC: 4.967e+05
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 0.0037 0.002 1.737 0.082 -0.000 0.008
n_cash -0.3195 0.002 -134.431 0.000 -0.324 -0.315
n_leverage -5.583e-06 2.33e-07 -23.971 0.000 -6.04e-06 -5.13e-06
n_investment 0.0576 0.002 25.736 0.000 0.053 0.062
lev100 -0.0006 2.33e-05 -23.971 0.000 -0.001 -0.001
==============================================================================
Omnibus: 126392.896 Durbin-Watson: 1.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1848286.569
Skew: -3.171 Prob(JB): 0.00
Kurtosis: 17.112 Cond. No. 2.61e+17
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.56e-26. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Note how the coefficient on n_leverage
has changed. Also, look at Notes [2] above.
# Check for correlations to see which which variable to drop
X.corr()
const | n_cash | n_leverage | n_investment | lev100 | |
---|---|---|---|---|---|
const | NaN | NaN | NaN | NaN | NaN |
n_cash | NaN | 1.000000 | -0.315295 | -0.070145 | -0.315295 |
n_leverage | NaN | -0.315295 | 1.000000 | 0.074024 | 1.000000 |
n_investment | NaN | -0.070145 | 0.074024 | 1.000000 | 0.074024 |
lev100 | NaN | -0.315295 | 1.000000 | 0.074024 | 1.000000 |
Multicollinearity can arise even if a “linear combination” (a weighted sum or difference) of our variables is highly correlated with some other variable in the regression. To see this in action, we will add to our explanatory variables, a variable called illiquid
which measures the non-cash assets of the firm (divided by total assets). In this case, the sum of cash
and illiquid
will equal 1 at all times, which is equal to another explanatory variable in our regression: the constant term.
'illiquid'] = (comp['at'] - comp['che']) / comp['at'] comp[
# Run regression with multicollinearity problem
= comp[['const','cash','leverage','investment','illiquid']]
X = sm.OLS(comp['future_roa'],X,missing = 'drop').fit()
results_mc print(results_mc.summary())
OLS Regression Results
==============================================================================
Dep. Variable: future_roa R-squared: 0.001
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 65.43
Date: Fri, 25 Feb 2022 Prob (F-statistic): 2.82e-42
Time: 14:32:29 Log-Likelihood: -3.4663e+05
No. Observations: 185315 AIC: 6.933e+05
Df Residuals: 185311 BIC: 6.933e+05
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0712 0.005 -15.175 0.000 -0.080 -0.062
cash -0.1568 0.011 -14.354 0.000 -0.178 -0.135
leverage -0.1111 0.018 -6.133 0.000 -0.147 -0.076
investment -0.0013 0.000 -3.528 0.000 -0.002 -0.001
illiquid 0.0856 0.008 11.360 0.000 0.071 0.100
==============================================================================
Omnibus: 560269.038 Durbin-Watson: 0.913
Prob(Omnibus): 0.000 Jarque-Bera (JB): 887102384117.364
Skew: 43.457 Prob(JB): 0.00
Kurtosis: 10721.226 Cond. No. 7.92e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.74e-27. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Again, we did not get an error, but the results above can not be trusted. To see this, you can check Notes [2] above, but you can also print out the correlation matrix.
X.corr()
const | cash | leverage | investment | illiquid | |
---|---|---|---|---|---|
const | NaN | NaN | NaN | NaN | NaN |
cash | NaN | 1.000000 | -0.244805 | -0.013678 | -1.000000 |
leverage | NaN | -0.244805 | 1.000000 | 0.000063 | 0.244805 |
investment | NaN | -0.013678 | 0.000063 | 1.000000 | 0.013678 |
illiquid | NaN | -1.000000 | 0.244805 | 0.013678 | 1.000000 |
= comp['cash'] + comp['illiquid']
thesum thesum
188730 1.0
188550 1.0
188566 1.0
188567 1.0
188568 1.0
...
493016 1.0
493017 1.0
493020 1.0
493021 1.0
493024 1.0
Length: 237017, dtype: float64
Again, dropping one of the problem variables (“cash” or “illiquid”) would solve our problem.